Spatio-Temporal Convolutional Sparse Auto-Encoder for Sequence Classification
نویسندگان
چکیده
We address in this paper the problem of task-independent video sequence classification. We aim to introduce a generic model which differ from the highly problem-dependent dominant methodology that relies on so-called hand-crafted features. We propose a learning-based model with two main steps: the first one aims to automatically learn spatio-temporal features instead of hand-crafting them. These learned features are sparseovercomplete, i.e. their dimension is larger than the input one, but only a small number of components are non-zero. The second step consists in labelizing the entire video sequence considering the temporal evolution of the learned features. The first learning step is performed in an unsupervised way, and is based on spatio-temporal convolutional sparse autoencoders (which will be introduced hereafter), and the second consists in a supervised classification using recurrent neural networks. The feature learning process is based on an auto-encoder scheme: an encoder which builds a non-sparse code vector representing the spatiotemporal salient information contained in the input, and a decoder which learns to reconstruct the input from a sparse version of the obtained code (see Figure 1). The model takes as input small space-time patches in order to reduce the diversity of the content to be encoded, since the patterns are locally less variable than if the full frame was considered. The encoder is a convolutional neural network with 2D+ t convolution kernels (each one having the same size than the input patch). The decoder consists in a set of output neurons fully connected to the sparse code layer. The sparsity is obtained using the sparsifying logistic proposed by Ranzato et al. [2], placed between the encoder and the decoder. This model is associated to a global objective function, which is the sum of two terms, representing respectively the encoder prediction and the decoder reconstruction mean square errors, and which is minimized during training. In order to handle the spatial and temporal shift-invariance of the learned representations, a “best shift search” module is introduced before the auto-encoder (see Figure 1). The idea is to represent the spatiotemporal neighbourhood of a given input patch by a single “shifted” patch, which is the one minimizing the objective function, given the current set of parameters. To that aim, an additional hidden variable is introduced, the translation vector, on which the optimization is done. To avoid encoding non-relevant patterns (e.g. colour and texture), the model is trained only with the patches containing significant spatiotemporal information (according to a motion-based selection critirea). This plays the same role as the saliency detectors in the case of the handcrafted features. Entire sequences are finally labeled with a particular recurrent neural network classifier, namely Long Short-Term Memory Recurrent Neural Network (LSTM) [1], in order to take benefits of its ability to use the temporal evolution of features for classification. The LSTM classifier takes as input a sequence of feature vectors, each one corresponding to the concatenated responses of the patches placed at the grid of possible locations in each frame. Aiming at verifying the genericity of the proposed model, experiments were carried out on two different problems: human actions and facial expression recognition. For the first experiment, we used the standard KTH human actions dataset [3]. To our knowledge, our method obtains the best results among the methods using automatically learned features, both on the two versions of the KTH dataset (95.83% for KTH1 and 93.74% for KTH2). More generally, we obtain the second best result for KTH1, and the third for KTH2, even when compared with approaches relying on hand-crafted features designed for the KTH dataset. Figure 1: Overview of the proposed spatio-temporal convolutional sparse auto-encoder: Illustration on a sample from the GEMEP-FERA facial expressions dataset.
منابع مشابه
Spatial-spectral Classification Based on the Unsupervised Convolutional Sparse Auto-encoder for Hyperspectral Remote Sensing Imagery
Current hyperspectral remote sensing imagery spatial-spectral classification methods mainly consider concatenating the spectral information vectors and spatial information vectors together. However, the combined spatial-spectral information vectors may cause information loss and concatenation deficiency for the classification task. To efficiently represent the spatial-spectral feature informati...
متن کاملA Particle Swarm Optimization-based Flexible Convolutional Auto-Encoder for Image Classification
Convolutional auto-encoders have shown their remarkable performance in stacking to deep convolutional neural networks for classifying image data during past several years. However, they are unable to construct the state-of-the-art convolutional neural networks due to their intrinsic architectures. In this regard, we propose a flexible convolutional auto-encoder by eliminating the constraints on...
متن کاملCrop Land Change Monitoring Based on Deep Learning Algorithm Using Multi-temporal Hyperspectral Images
Change detection is done with the purpose of analyzing two or more images of a region that has been obtained at different times which is Generally one of the most important applications of satellite imagery is urban development, environmental inspection, agricultural monitoring, hazard assessment, and natural disaster. The purpose of using deep learning algorithms, in particular, convolutional ...
متن کاملBLIND PARAMETER ESTIMATION OF A RATE k/n CONVOLUTIONAL CODE IN NOISELESS CASE
This paper concerns to blind identification of a convolutional code with desired rate in a noiseless transmission scenario. To the best of our knowledge, blind estimation of convolutional code based on only the received bitstream doesn’t lead to a unique solution. Hence, without loss of generality, we will assume that the transmitter employs a non-catastrophic encoder. Moreover, we consider a c...
متن کاملA Deep Convolutional Auto-Encoder with Pooling - Unpooling Layers in Caffe
This paper presents the development of several models of a deep convolutional auto-encoder in the Caffe deep learning framework and their experimental evaluation on the example of MNIST dataset. We have created five models of a convolutional auto-encoder which differ architecturally by the presence or absence of pooling and unpooling layers in the auto-encoder’s encoder and decoder parts. Our r...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012